Please refamilarise yourself with workshop 2 before proceeding. Availible on my github or on blackboard.
Setup
Load packages
library(tidyverse)
library(readxl)
library(cowplot)
library(ggbeeswarm)
library(EnhancedVolcano)
library(ggpubr)
Public datasets
Typically data from flow cytometery experiments don’t get uplifted to public datasets. It is much more common to upload proteomic data transcriptomic data. This partly reflects that journals have requirements that data from these experiments should be uploaded to databases such as NCBI. This has led to a transformation in modern biology - with so much data freely available, it is often unnesscary to even do an experiment! More realistically, experiments are still critical but these datsets are used as a form of hypothesis generation. Allowing scientists to test ideas in silico before spending resourses in the lab.
Here we will download a public dataset from NCBI and reanalyse part of that datset. We will use data generated by a microarray which is a well established, if ageing, method of measuring RNA transcripts in a tissue sample. It uses probes to detect mRNA molecules isolated from cells.
The data generated from microarray (and RNA-seq) is typically put through a differential gene expression analysis pipline which compare two groups of samples (e.g. control group, manipulated group) to produce data which contains a fold change and statistical tests for each gene in the array (often 10 of 1000s of genes).
Downlaod the data
So we will use a dataset from a paper I liked. This is a complicated paper but we want to take a take a small subset of the data and use it. This is for a macrophage in the peritoneal cavity called a large peritoneal macrophage (LPM). This is similar to the LCM in my dataset above.
We want to compare the transcriptome between LPM from WT mice given no stimulus and WT LPM given IL-4 complex a cytokine which activates type 2 immune responses and in macrophages is said to lead to ‘alternative activation’ (M2, M(IL-4) activation)
The paper has a link to the data in the methods section of the paper: GSE125727.
This brings us to a NCBI page for the datasets. For microarray datasets we can easily reanalyse these data using an R script built into the website called “Geo2R”. Analysis of RNA-sew data is a little more involved, but we will ignore that in this course.
I have highlighted the relevant samples. To start click geo2R.
On the next page, we define our groups. Do this by clicking ‘define groups’ typing 2 names, “naive” and “IL-4c”. To assign samples to those groups, cntrl or shift click the relevant sameps and then click on those groups.
How to assign groups
Then scroll down to the bottom of the page and click “Analyze”. This will tell the server to run the r script using the argument you provided (group information). You can actually view the R script if you like, or even copy it into R to run it on your own computer. Although you will need to install more packages, and the code might be a bit over your head at this stage. The analysis is a version of differential gene expression which looks at genes which are up or down in two groups.
This will take a couple of minutes to run, but eventually you will get a lot of outputs. What it will produce is a table of genes as rows, with various columns including 2 we are interested in, Fold change and P values.
Super cool outputs!
You can download these pictures if you like but what you really want is the table file. So select ‘download full table.’ This will download the a tsv file. A tsv file is like a csv file expect the data is separated by tabs not commas.
Move the file into the data folder for this project.
load in data
Here we will load in the data (assuming you were able to download it and put it in the right place). Unlike before we have to tell R that the tabs separate the data so the function needs another argument, sep="\t"
Your dataset will have a different name so you will have to adjust the file name and location. Hint: use tab complete inside "".
micro <- read.csv(file = "data/GSE125727.top.table.tsv", sep = "\t")
Lets investigate the daraframe
dim(micro)
[1] 30177 8
So we have a lot or rows and 8 columns. Let’s look at the columns first.
colnames(micro)
[1] "ID" "adj.P.Val" "P.Value" "t" "B"
[6] "logFC" "Gene.symbol" "Gene.title"
Hmmm, this looks fairly understandable. The cols we will be interested in are "logFC’ or log base 2 of fold change, and adjusted P value. We could select only that data to simplify matters.
However the gene symbols may change, because this data is supplied by the original scicentists and they may have come up with their own name. We next print out the table to see if we can spot gene names, which should be in the format of Gapdh for mouse and GAPDH for huamn.
(head((micro)))
ID adj.P.Val P.Value t B logFC Gene.symbol
1 10475517 2.28e-07 7.55e-12 -67.51334 15.19131 -7.661585 AA467197
2 10419568 9.67e-07 6.41e-11 -50.94326 14.26932 -8.391100 Rnase2a
3 10501026 2.96e-06 2.94e-10 -41.67332 13.41073 -5.504030 Chil4
4 10476945 4.65e-06 7.88e-10 -36.58754 12.76415 -5.747173 Cst7
5 10534927 4.65e-06 8.49e-10 36.22528 12.71189 3.885703 Pilra
6 10378793 4.65e-06 1.04e-09 -35.27119 12.56978 -4.028273 Tmigd1
Gene.title
1 expressed sequence AA467197
2 ribonuclease, RNase A family, 2A (liver, eosinophil-derived neurotoxin)
3 chitinase-like 4
4 cystatin F (leukocystatin)
5 paired immunoglobin-like type 2 receptor alpha
6 transmembrane and immunoglobulin domain containing 1
Sow where is your Gene symbols? What is the column title, does it have no title?
If this is the case you can adapt the following code to rename your variable to Gene.symbol
# remove "#" to to run lines of code
# micro$Gene.symbol <- micro$[the namme of your col containing gene symbols]
# common examples below
#micro$Gene.symbol <- micro$ORF
micro <- micro %>% select(c("P.Value","logFC","Gene.symbol"))
Now our dataset micro contains only three cols. The dataset contains 30177 rows, likely corresponding to genes (in reality they correspond to probes not genes, so the same gene may appear multiple times). We would not write out 30177 genes but we can use the head function to look at the first few.
head(micro$Gene.symbol)
[1] "AA467197" "Rnase2a" "Chil4" "Cst7" "Pilra" "Tmigd1"
Cool! They look like genes to me! Chil4 is one of my favourite mouse genes too!
volcano plot
We will produce a basic volcano plot in R using ggplot2. We will need one more R package called ggrepel. Install this using install.packages("ggrepel").
library(ggrepel)
The most basic volcano plot can be made with the following code:
micro %>%
ggplot(aes(x =`logFC`,
y=-log(`P.Value`) # do you notice the -log function?
)) +
geom_point() + theme_cowplot()
We can of course make this look better! Lets creat a new col that that can be used to highlight certain highly differential genes. The code from here on down might be a bit more complicated.
You can change the p value cutoffs here if you like reducing or increasing them to suit your dataset
If your dataset is very different and the values are not that different, you may want to reduce the LofFC >= 1 or something similar. Just because you don’t have log2FC of +5 or -7 doesn’t mean there isn’t differences, perhaps the amount of mRNA captured was less or the difference between the groups was subtle (ofthen the case for KO cells when the cell type is the same, but one gene is lost.)
# mark up/down regulated genes
micro <- micro %>%
mutate(gene_type = case_when(logFC >= 1 & P.Value <= (0.05) ~ "up",
logFC <= -1 & P.Value <= (0.05) ~ "down",
TRUE ~ "ns"))
So what did that create? Let’s look
head(micro$gene_type)
[1] "down" "down" "down" "down" "up" "down"
We will also create some variable that we can use for advanced plotting
# Add colour, size and alpha (transparency) to volcano plot --------------------
cols <- c("up" = "#ffad73", "down" = "#26b3ff", "ns" = "grey")
sizes <- c("up" = 2, "down" = 2, "ns" = 1)
alphas <- c("up" = 1, "down" = 1, "ns" = 0.4)
micro %>%
ggplot(aes(x =`logFC`,
y=-log(`P.Value`),
fill = gene_type,
size = gene_type,
alpha = gene_type)) +
ggtitle('My title') +
xlab('Log2 Fold Change') +
geom_point(shape = 21, # Specify shape and colour as fixed local parameters
colour = "black") +
geom_hline(yintercept = -log10(0.001),
linetype = "dashed") +
geom_vline(xintercept = c(-2, 2),
linetype = "dashed") +
scale_fill_manual(values = cols) + # Modify point colour
scale_size_manual(values = sizes) + # Modify point size
scale_alpha_manual(values = alphas) + # Modify point transparency
theme_cowplot() +
theme(legend.position = 'none')
That looks nicer. I won’t describe each line above, but you can read each line and see if you can work out what it does. You can comment out a line in the code using a hash
# at the start of a line to stop the code being run, but without deleting it. Try commenting out various lines to see what that does.
Adding text to the volcano plot
The last thing we want to do is add text to the graph. Now we do not want to label all of the genes or our plot will look terrible and take hours to run. We just want to highlight the top few differential gene and highlight them on the graph.
First lets select the most differential genes. First arrange by LogFC ignoring minus sign using the abs() function and the desc() to place them in decending order.
micro <- micro %>%
arrange( desc(abs(logFC)))
head(micro, 20)
P.Value logFC Gene.symbol gene_type
1 6.41e-11 -8.391100 Rnase2a down
2 7.55e-12 -7.661585 AA467197 down
3 8.17e-09 -6.095598 Mgl2 down
4 7.88e-10 -5.747173 Cst7 down
5 5.79e-09 -5.659000 Chil3 down
6 2.94e-10 -5.504030 Chil4 down
7 1.66e-09 -5.436653 Timd2 down
8 4.82e-08 5.359922 Fpr1 up
9 8.15e-08 5.320450 Cxcl13 up
10 2.17e-07 -5.071113 Ucp1 down
11 1.98e-07 -5.016402 Klk1b11 down
12 2.40e-08 -4.988765 I830127L07Rik down
13 2.50e-08 -4.988538 Mall down
14 1.03e-07 -4.849938 Gm5150 down
15 3.07e-09 -4.813952 Spp1 down
16 6.80e-09 -4.772435 Gatm down
17 7.35e-08 4.520842 Cxcl2 up
18 7.51e-07 4.516250 Ccl3 up
[ reached 'max' / getOption("max.print") -- omitted 2 rows ]
Now get the 20 top genes here.
genes_to_graph <- head(micro$Gene.symbol, 16)
What if we wanted to add some genes we like to this? Well we can just append the vector with some more gene symbols
Next we want to create a new object which is a subset of micro that contains only the genes we are interested in. This is useful for telling ggplot to label only specific points. There are actually many ways of doing this and this is atually not a very intuitive method, but it works.
identify <- micro %>% filter(Gene.symbol %in% genes_to_graph )
Finally we graph this.
micro %>%
ggplot(aes(x =-`logFC`,
y=-log(`P.Value`),
fill = gene_type,
size = gene_type,
alpha = gene_type)) +
ggtitle('My title') + # cjamge plot title
xlab('Log2 Fold Change') + # x axis label
geom_point(shape = 21, # Specify shape and colour as fixed local parameters
colour = "black") +
geom_hline(yintercept = -log10(0.001), # horizontal line
linetype = "dashed") +
geom_vline(xintercept = c(-2, 2), #vertical line
linetype = "dashed") +
scale_fill_manual(values = cols) + # Modify point colour
scale_size_manual(values = sizes) + # Modify point size
scale_alpha_manual(values = alphas) + # Modify point transparency
geom_label_repel(data = identify,size=5, # Add labels last to appear as the top layer
aes(label = Gene.symbol),
force = 3,
nudge_y = 1,max.overlaps = 200) +
theme_cowplot(font_size = 18) +
theme(legend.position = 'none')
ggsave(filename = "volcano_.png",width = 8,height = 5,dpi = 300)
Try to reverse engineer some of that code, removing some lines and changing values.
Advanced analysis
The following packages maybe required for the more advanced analysis in using Jamie’s code or if you want to run geo2r’s R script yourself
You can find the Rscript here
Just copy this into R and run it and it should work
load additonal packages
These can be installed with the functions
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("GEOquery") BiocManager::install("limma") install.packages(“umap”) install.packages("RcolorBrewer") install.packages("ggrepel") install.packages("ggfortify")
library(GEOquery)
library(limma)
library(RColorBrewer)
library(pheatmap)
library(umap)
library(ggrepel)
library(ggfortify)
#if you get errors try running this code
#Sys.setenv("VROOM_CONNECTION_SIZE" = 131072 * 4)
#readr::local_edition(1)
Of course you are welcome to produce your own analysis! Just google mciroarray analysis in R.
Assessment
Each group will be given a paper.
In each paper there is a microarray dataset. The GSE number for the data set is in the paper. Search for this dataset on the NCBI website.
You are asked to analyse the dataset using the online GEO2R tool. You must define groups for this to work. It is critical that you define only 2 groups. Where more than two groups exist use your judgment from reading the paper as to which group comparison is relevant. It might be more interesting to choose groups which are very different than each other, e.g. stim vs no stim. Please download your table as a csv file by clicking on ‘Download full table’. Save this file as ‘x.csv’
Read this into R and produce some analysis. E.g. volcano plot.